Theory of Deep Learning IIb: Optimization Properties of SGD

نویسندگان

  • Chiyuan Zhang
  • Qianli Liao
  • Alexander Rakhlin
  • Brando Miranda
  • Noah Golowich
  • Tomaso A. Poggio
چکیده

In Theory IIb we characterize with a mix of theory and experiments the optimization of deep convolutional networks by Stochastic Gradient Descent. The main new result in this paper is theoretical and experimental evidence for the following conjecture about SGD: SGD concentrates in probability like the classical Langevin equation – on large volume, “flat” minima, selecting flat minimizers which are with very high probability also global minimizers. This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF 123 1216. This memo reports those parts of CBMM Memo 067 that are focused on optimization. The main reason is to straighten up the titles of the theory trilogy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Musings on Deep Learning: Properties of SGD

We ruminate with a mix of theory and experiments on the optimization and generalization properties of deep convolutional networks trained with Stochastic Gradient Descent in classification tasks. A present perceived puzzle is that deep networks show good predictive performance when overparametrization relative to the number of training data suggests overfitting. We dream an explanation of these...

متن کامل

Theory of Deep Learning III: Generalization Properties of SGD

In Theory III we characterize with a mix of theory and experiments the consistency and generalization properties of deep convolutional networks trained with Stochastic Gradient Descent in classification tasks. A present perceived puzzle is that deep networks show good predicitve performance when overparametrization relative to the number of training data suggests overfitting. We describe an exp...

متن کامل

YellowFin and the Art of Momentum Tuning

Adaptive Optimization Hyperparameter tuning is a big cost of deep learning. Momentum: a key hyperparameter to SGD and variants. Adaptive methods, e.g. Adam1, don’t tune momentum. YellowFin optimizer • Based on the robustness properties of momentum. • Auto-tuning of momentum and learning rate in SGD. • Closed-loop momentum control for async. training. Experiments ResNet and LSTM YellowFin runs w...

متن کامل

Annealed Gradient Descent for Deep Learning

Stochastic gradient descent (SGD) has been regarded as a successful optimization algorithm in machine learning. In this paper, we propose a novel annealed gradient descent (AGD) method for non-convex optimization in deep learning. AGD optimizes a sequence of gradually improved smoother mosaic functions that approximate the original non-convex objective function according to an annealing schedul...

متن کامل

When Does Stochastic Gradient Algorithm Work Well?

In this paper, we consider a general stochastic optimization problem which is often at the core of supervised learning, such as deep learning and linear classification. We consider a standard stochastic gradient descent (SGD) method with a fixed, large step size and propose a novel assumption on the objective function, under which this method has the improved convergence rates (to a neighborhoo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1801.02254  شماره 

صفحات  -

تاریخ انتشار 2017